Skip to content

Bound Ratis reconfiguration retries and add region migration ITs#17895

Merged
CRZbulabula merged 4 commits into
masterfrom
enhance-ratis-reconfiguration-retry-limit
Jun 10, 2026
Merged

Bound Ratis reconfiguration retries and add region migration ITs#17895
CRZbulabula merged 4 commits into
masterfrom
enhance-ratis-reconfiguration-retry-limit

Conversation

@CRZbulabula

@CRZbulabula CRZbulabula commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

When a DataNode that hosts a newly ADDING peer is killed during a Ratis region migration, the Ratis configuration change can stay in the staging state. The previous reconfiguration client retried these failures forever, so the DataNode add-peer task could remain PROCESSING indefinitely and the ConfigNode migration procedure could not fail and roll back.

Changes

  • Bound Ratis reconfiguration retries with a new reconfigurationMaxRetryAttempts client setting. Reconfiguration requests still use a fixed 2s retry interval, but now stop after the configured attempt count and propagate the last failure to the caller.
  • Renamed the reconfiguration retry client factory/policy from the old endless-retry naming to ReconfigurationRetryFactory and RatisReconfigurationRetryPolicy.
  • Added per-scope retry knobs:
    • config_node_ratis_reconfiguration_max_retry_attempts
    • schema_region_ratis_reconfiguration_max_retry_attempts
    • data_region_ratis_reconfiguration_max_retry_attempts
  • Set the default retry count to 15, which is roughly 30 seconds with the fixed 2s interval. This keeps the timeout close to the default ConfigNode Unknown detection window while still leaving a small buffer for transient delays.
  • Propagated schema/data reconfiguration retry settings from ConfigNode to DataNodes through optional TRatisConfig fields. The fields are optional so rolling upgrades can fall back to local DataNode defaults when an old ConfigNode does not send them.
  • Wired the new retry setting through ConfigNode, DataNode, and the Ratis consensus config builders for ConfigNode, schema-region, and data-region consensus.
  • Added integration-test config setters for the new Ratis reconfiguration retry knobs.
  • Extended the region migration reliability IT framework with a failure-and-rollback path that waits for kill points, verifies the region replica set returns to the original DataNode set, and verifies the target peer is cleared.
  • Added a Ratis DataNode kill point during destination local peer creation so the ADDING-peer crash path can be reproduced.

Ratis IT Coverage

The branch adds Ratis region migration ITs to match the IoTV1 coverage pattern and cover Ratis-specific consensus-pipe states:

  • DataNode crash while creating the destination local peer, expecting migration failure and rollback.
  • ConfigNode crash at add-region states:
    • CREATE_NEW_REGION_PEER
    • CREATE_CONSENSUS_PIPES
    • DO_ADD_REGION_PEER
    • UPDATE_REGION_LOCATION_CACHE
  • ConfigNode crash at remove-region states:
    • TRANSFER_REGION_LEADER
    • REMOVE_REGION_PEER
    • DELETE_OLD_REGION_PEER
    • DROP_CONSENSUS_PIPES
    • REMOVE_REGION_LOCATION_CACHE
  • ConfigNode crash with all add/remove migration kill points enabled in one migration.
  • Cluster crash at the IoTV1-covered migration states, plus Ratis CREATE_CONSENSUS_PIPES and DROP_CONSENSUS_PIPES.

The Ratis IT setup uses reconfiguration retry count 10 to keep failure-path tests bounded. The new Ratis IT classes are now tagged with DailyIT after passing CI.

Verification

  • mvn spotless:apply -pl integration-test -P with-integration-tests
  • mvn test-compile -pl integration-test -P with-integration-tests -DskipTests
  • mvn verify -DskipUTs -Dit.test=IoTDBRegionMigrateAddingPeerCrashForRatisIT,IoTDBRegionMigrateConfigNodeCrashForRatisIT,IoTDBRegionMigrateClusterCrashForRatisIT -DintegrationTest.includedGroups= -DintegrationTest.excludedGroups= -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -P with-integration-tests
    • 19 tests passed.
  • mvn spotless:check -pl integration-test -P with-integration-tests
  • git diff --check

When scaling in a DataNode, if a peer that is being ADDED to a Ratis
schema/data region (during region migration) is killed before it catches
up, the leader's reconfiguration can never commit. The Ratis client used
for reconfiguration retried forever (retryForeverWithSleep), so the
coordinator DataNode's AddRegionPeerTask blocked indefinitely inside
setConfiguration, leaving the migration permanently stuck -- and CANCEL
ineffective, since the task never left PROCESSING.

Bound the reconfiguration retries instead of retrying forever: after the
limit is exhausted the last failure propagates, so the upper layer can
fail and roll back the migration (and CANCEL becomes reachable again).

Expose the limit as a per-group config, pushed from ConfigNode to
DataNode via TRatisConfig (optional fields, for rolling-upgrade safety):
- config_node_ratis_reconfiguration_max_retry_attempts
- schema_region_ratis_reconfiguration_max_retry_attempts
- data_region_ratis_reconfiguration_max_retry_attempts
Default 600 attempts at the fixed 2s interval (~20min cap).

Also rename the now-misnamed EndlessRetryFactory /
RatisEndlessRetryPolicy to Reconfiguration*, since the policy is no
longer endless.
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 38.02817% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.74%. Comparing base (07b9cb0) to head (9f02f64).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...he/iotdb/confignode/conf/ConfigNodeDescriptor.java 0.00% 12 Missing ⚠️
...apache/iotdb/confignode/conf/ConfigNodeConfig.java 25.00% 9 Missing ⚠️
...ol/thrift/impl/DataNodeInternalRPCServiceImpl.java 0.00% 7 Missing ⚠️
...java/org/apache/iotdb/db/conf/IoTDBDescriptor.java 0.00% 6 Missing ⚠️
...che/iotdb/confignode/manager/node/NodeManager.java 0.00% 4 Missing ⚠️
...ain/java/org/apache/iotdb/db/conf/IoTDBConfig.java 50.00% 4 Missing ⚠️
...confignode/manager/consensus/ConsensusManager.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17895      +/-   ##
============================================
+ Coverage     40.68%   40.74%   +0.06%     
- Complexity     2620     2621       +1     
============================================
  Files          5244     5244              
  Lines        362374   362696     +322     
  Branches      46653    46687      +34     
============================================
+ Hits         147419   147793     +374     
+ Misses       214955   214903      -52     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@CRZbulabula CRZbulabula changed the title Bound Ratis reconfiguration retries to avoid stuck region migration Bound Ratis reconfiguration retries and add region migration ITs Jun 10, 2026
Comment on lines +321 to +330
/**
* RatisConsensus protocol, max retry attempts for a configuration change (add/remove peer). Uses
* a fixed 2s retry interval; bounding the attempts stops a killed ADDING peer from blocking the
* reconfiguration -- and hence a region migration -- forever.
*/
private int configNodeRatisReconfigurationMaxRetryAttempts = 600;

private int dataRegionRatisReconfigurationMaxRetryAttempts = 600;
private int schemaRegionRatisReconfigurationMaxRetryAttempts = 600;

@jt2594838 jt2594838 Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will 20 minutes be too long? May use a similar interval, referring to how CN determines whether a node is unknown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. 20 minutes is too long for this failure path. I changed the default reconfiguration retry attempts from 600 to 15, which is about 30 seconds with the fixed 2s retry interval. This keeps it close to the ConfigNode Unknown detection window while leaving a small buffer for transient delays. The newly added Ratis region migration ITs have also been moved into DailyIT after the CI passed.

@sonarqubecloud

Copy link
Copy Markdown

@CRZbulabula CRZbulabula merged commit 3030597 into master Jun 10, 2026
44 of 46 checks passed
@CRZbulabula CRZbulabula deleted the enhance-ratis-reconfiguration-retry-limit branch June 10, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants